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THE AVERAGE READING VOCABULARY; AN APPLICATION OF 

BAYES'S THEOREM. 

By WARREN WEAVER, University of Wisconsin. 

The rule for the computation of a posteriori probabilities was first developed 
by an English clergyman, T. Bayes, and was published after his death in the 
Philosophical Transactions for 1763. The careless application of this rule has 
led to many paradoxical results,^ in consequence of which some mathematicians 
would abandon the rule entirely. Among this number may be mentioned Mr. 
J. Bing, a Danish actuary, the late Dr. T. Thiele, and especially Professor 
Chrystal, whose advice is to "bury the laws of inverse probability decently out 
of sight." The problem herein stated and solved may be of interest since it 
clearly emphasizes the point the neglect of which has led to incorrect results, 
since it shows what great allowances may sometimes be made in the a priori 
probabilities of existence and still allow us to change our view in regard to a 
statistical result, and since it is an answer to a specific case of that interesting 
question, to what degree does new experimental evidence justify us in modifying 
previously held opinions. 

1. Statement of the theorem. Some one of the mutually exclusive causes 
Ai, Ai, • • • ^„ is to produce an event. When the result is not known (i.e., 
before the event occurs) the existence probability for each cause is tti, Vi, • - • Tr„ 
{i.e., the a priori probability of the existence of each cause). The event in 
question occurs. The cause Ai, when it is known to act, gives the productive 
probability pi, etc. Then the a posteriori probability that the cause Ar produced 
the event is 

P = 2rlr Q) 



PrTTr 

PlTi + P2T2 + • • • + PrlTr + ' * ' + PnlTn ' 



If the event in question is able to occur in two alternative ways one of which 
we call "successful," and the other of which "unsuccessful"; and if, further, the 
productive probability that the cause Ar produce the event successfully be Wr, 
then if the event occurs successfully m times in k trials the a posteriori probability 
that the cause Ar has been the one to act is 

^' = l^TTiwni - m)"-^ (t = 1, 2, • • • n). (2) 

The theorem may assume a third form in problems of such a nature that the 
different causes Ar may be considered different stages of a continuously changing 
complex. In this case the quantities involved in the formula become definite 
integrals.^ 

^ For example, Bing's Paradox: "If among a large group of S equally old persons we have 
observed no deaths during a full calendar year, then another person of the same age outside the 
group is certain to die inside the calendar year" quoted from The Mathematical Theory of Proba- 
bilities, by A. Fisher. Volume 1, New York, 1915, p. 75. 

2 A. Fisher, l.c., p. 67. 



348 THE AVERAGE READING VOCABULARY. [Oct., 

To those not entirely familiar with the theorem an example may make the 
statement of it more clear. Suppose that we have an urn filled with black and 
white balls in unknown proportion, and that our a priori estimate of the existence 
probability that there are x white and (b — x) black is Wx (b being the total 
number of balls in the urn) . Suppose that we draw k balls from the urn, returning 
each, and find that of these m are white and k — m black. What is then the 
most probable mixture in the urn? It will not be the one originally most probable, 
that is, the one for which ttj, is a maximum; nor will it be the one suggested by the 
drawing,^ but will obviously be some mixture intermediate between these two. 
It will be, in fact, a mixture of x* white and (6 — a:*) black, where x* is that value 
of X for which 

(x\r (h— ajV-™ 

„_ "UJl^j (x-.l,2,...„, (3) 



fx'Y-Zh-x'X 



k—m. 



is a maximum. In case ttx is given, by experiment, judgment, or calculation, for 
a certain finite number of values of x, we might determine this value x* by plotting 
to any scale whatsoever, and noting the value of x corresponding to the highest 
point on the curve. If we should later wish the vertical scale of this curve we 
could most easily determine it from the fact that the area under it must equal unity. 
2. Statement of the problem. A test has been devised by the department of 
educational psychology at the University of Wisconsin to determine a person's 
reading vocabulary. The process consists of taking at random 200 words from 
the dictionary, and having a person decide with how many of the 200 words he 
is familiar — say 117. Then the value of this person's reading vocabulary is taken 
as 

(117/200) 104,000 

or about 61,000. (104,000 being the approximate number of words in the 
dictionary.) 

The scheme has been found to give, as the result of about five hundred tests 
by university students, the value given above. This value is far in excess of 
previous estimates, the general opinion before this test being, according to 
Professor Starch, that the correct figure was in the neighborhood of 25,000. 
We wish to investigate whether this process gives us a sound basis for raising our 
previous estimate of 25,000 to 61,000, and if not, what our answer should be. 

Since, as will appear later, the a priori probabilities of different estimates have 
an important effect upon the solution of the problem it is necessary to inquire 
as carefully as the nature of the existing information will permit into the basis 
and reliability of this estimate of 25,000 words. Unfortunately the information 
is vague, but it appears to be an average opinion of those who had been interested 
in the matter, rather than a definite statistical result. There may have been, 
to be sure, some numerical method, however unsatisfactory, by which the 

2 Unless, of course, these two mixtures happen to be the same. 
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estimates were arrived at; or it may be that an investigator with a great deal 
of experience along this line would venture an estimate based upon intuition 
alone. It is evident as a mere matter of common sense that if this estimate of 
25,000 words were formed upon the basis of one application of this test itself, 
and the next application of the test furnished an estimate of 61,000, one would 
not be justified in completely discarding the old estimate for the new. It is 
obviously a matter of the comparative reliability of the new and old judgments, 
which comparison is made accurately by the theorem stated. 

3. Solution of the problem. The actual method used in taking the sample 
was to take the first word on the kih page of a Webster's Unabridged Dictionary, 
k having such a value that the method would result in a sample of 200. There 
being so many words to pick from, it is evident that it is immaterial whether or 
not we consider that we return each word after its drawing: a consideration which 
in other cases might be important. It seems likely, on an intuitive basis, that if 
the same person performed the test several times with different samples, or if it 
were performed with several persons and the same sample, results would be ob- 
tained that would vary widely, especially since 200 seems a small sample from a 
group, of 104,000. It should be emphasized therefore that the datum which we use 
is the average of over five hundred results from different persons. And some 
knowledge of the variability of these results is important in making an estimate 
of the stability of the average. The following frequency table gives us an estimate 
of the variability in the results obtained from a typical group of fifty students, 
using the same sample of words. It is on the basis of 100 words rather than 200 
since, as a matter of proceedure in making the test, the whole list was split up 
into two lists of 100 each, and the score kept for each separately. 

No. of Words Known. No. of Students. No. of Words Known. No. of Students. 

46 61 1 

47 62 3 

48 1 63 2 

49 1 64 3 

50 2 65 

51 66 1 

52 3 67 3 

53 1 68 4 

54 2 69 

55 3 70 5 

56 1 71 

57 3 72 1 

58 4 73 

59 1 74 



60 5 



50 
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The result shown is typical of all obtained, a mean variation of approximately 
five being found among all the students tested. We conclude therefore that 117 
is an average of considerable stability, coming as it does from a group of over 
500 tests which show a relatively small variability. Unfortunately the data are 
not available as to how many sets were used : the errors of sampling in the dic- 
tionary would be reduced in the approximate ratio of 1 to Vri where n is the 
number of sets used. However the accuracy of any one list as expressed by the 
mean variation between two lists of 100 words each is between two and three 
words. We are led then to this conclusion: that if we imagine the hypothetical 
case of "the average" university student examining a perfectly fair sample of 
200 words it seems reasonable to assume that he would find among these 117 
that he would know. And any conclusion we may draw from this hypothetical 
case will have a reliability of approximately V500 = 22.4 times as much as it 
would have if it actually came from but a single trial. The fact that 200 is 
indeed a small sample will later appear in this, that the most probable mixture, 
even though it may be many times more probable than other mixtures not in 
the immediate neighborhood of the most probable one, is, as a matter of fact, a 
mixture whose probability is very small. This apparent paradox is often met 
with in problems involving large numbers. 

A mere change in the wording of the problem makes the application of equa- 
tion (2) evident. We have an urn filled with b (= 104,000) words in unknown pro- 
portion of "white" (known) words to "black" (unknown) words. From this 
urn we draw 200 and find 117 white and 83 black. In other words the event in 
question occurs 200 times in one of two alternative ways, it occurring 117 times 
in the way which we may call successful. The probability that the dictionary 
contains a mixture of x known and b — x unknown words is then 



VxT^Vb- xj' 



or 



txY^Vb — x'W 
T — J — , K being a constant. (5) 

We see at once that it is impossible to determine the value of x for which 
this expression is a maximum without a knowledge of the character of the term tt^. 
This is exactly the point at which errors often creep into applications of the 
theorem. It is often assumed from the logical principle of insufficient reason 
that TTx is a constant: in other words since we know too little to form a judgment 
it is assumed that the a priori probabilities of all causes are equal. The principle 
of insufficient reason leads, however, to notoriously paradoxical results. 

For the problem here considered, however, while the theoretically correct 
result cannot be obtained without a knowledge of ttx, we can easily show that for 
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practical purposes our knowledge concerning it may be very limited and still 
sufficient. 

Consider the curve 

where b = 104,000. This curve has its maximum value when 

X = (117/200) b = 61,000 (7) 

and the maximum value is equal to 

P'eiooo = 1.137 X 10-^". (8) 

However for x = 25,000 

P'25ooo = 2.207 X 10-^^ 

To obtain the value of x for which Px of equation (5) is a maximum we have to mul- 
tiply every ordinate of the curve Px by ttx (the existence probability of a mixture 
of X known and b — x unknown 
words), and then take the value 
of X corresponding to the highest 
point on this new curve. There- 
fore unless the a priori existence 
probability of a mixture con- 
taining 25,000 known words ex- 
ceeds the a priori existence 
probability of a mixture contain- 
ing 61,000 known words in the 

ratio of 0.5 X 10^ to 1 it is evident that the a posteriori probability of a mixture 
characterized by a; = 61,000 is greater than the a posteriori probability of a mix- 
ture in which x is only 25,000. In fact we have 

■Peiooo i'^'Teiooo X 1.137 X 10~^^ ^ r ^, ,„,. 'Teiooo /,„^ 

0.5 X lO''* , (10) 




■P25000 ■K^''i'26ooo X 2.207 X 10 7r25ooo 

which is greater than unity as long as 

7r26ooo < 0.5 X 102" X ^^^gg^_ (11) 

While this may convince us that the previous estimate of 25,000 is to be dis- 
carded it may not convince us that an estimate of, say, 50,000 is not as good a 
new estimate as that of 61,000 which the test indicates. Let us therefore con- 
sider the numerical magnitude of the ratio of the a posteriori probability of a 
mixture for which x = 61,000 to the a posteriori probability of a mixture for which 
X = 50,000. We have 

'eiooo T61000 I 61 I"' I 43 1°" --iTeiooo 



LsoJ L54J 



60000 'r6oooo L "3" J L 04 J Teoooo 



352 THE AVEKAGE READING VOCABULAEY. [Oct., 

which is greater than unity unless the a priori probability of a mixture of 50,000 
known words exceeded the a priori probability of a mixture of 61,000 known words 
by the factor 77. Since neither of these figures, 50,000 and 61,000, is in any way 
special before the test is made there would seem no justifiable basis for considering 
one more probable than the other in any such ratio as that just found. The 
answer to our problem would then be that 61,000 is the most probable answer 
for the number of words in the dictionary, this conclusion being reached regardless 
of the character of Tx outside of the one restriction stated in (11). It is under- 
stood, of course, that the character of the term ttx in the neighborhood of 
X = 61,000 might shift the most probable value slightly, but ttx would surely be 
changing very slowly for values of x in this vicinity, and the shift would be there- 
fore very small, and negligible for practical purposes. Although the question 
of whether equation (11) states a reasonable restriction upon the character of Tx is 
primarily one for educators to settle it would certainly seem sensible to assume 
that it does. We should surely agree that the previous results were not suffi- 
ciently well established that we could consider them, a priori, 0.5 X 10^^ times 
as likely to be true as any other result. We must say "any other" result since 
our estimate of the a priori probabilities, it being independent of the result of 
the test and therefore for psychological reasons best formed before the test takes 
place, could attach no special importance to the figure 61,000 — a number which 
is not known until after the test is performed. All we could say might be, for 
example, that we consider a result lying between, say, 20,000 and 30,000 one 
hundred times more likely than a result lying outside this band, and that we 
consider it certain that the actual mixture contains more than 5,000 and less than 
90,000 words that the average student knows. Such an assumption, coupled 
with the fact that the total area under the curve y = ttx must be unity gives 

TTx = < a; < 5000 

= 9.302 X 10-^ 5000 <x< 20000 



= 9.302 X 10-8 20000 < a; < 30000 
= 9.302 X 10-^ 30000 <x< 90000 
= 90000 <x < 104000 



(12) 



Then we have Px given by the full line on the graph. 

The vertical scale is again obtained from the fact that the area must equal 
unity. The probability of the most probable mixture is found to be 9.946 X 10""^, 
a very small probability as was earlier suggested would be the case. The values 
of Px outside the range shown are too small to be indicated. 

It is to be especially noted that this curve will approximately represent Px 
whatever the assumption concerning Wx, only provided, say, that 

7r26ooo < .5 X 10^° iTeiooo (13) 

This condition is slightly more stringent than (11), and insures that P25000 shall 
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be less than one ten thousandth of Peiooo, and therefore negligible. It is easily 
shown that in a small neighborhood of a; = (117/200) b the equation 

is equivalent to 

P, = Kw,P',i000 ef-(l"+83)3/2][Cl/83)(l/117)].^^ (14) 

where 

€ = (117/200)6 - X. (15) 

This equation reduces to 

P^ = Ce-«'-«^' (16) 

if we assume, as we have done, that ttx is constant in a small range about 
X = 61,000. The constant C is evaluated by means of the fact that when 
€ = 0,Px must be the probability of a mixture containing 61,000 known words, 
which probability we have previously calculated. Then 

C = 9.946 X 10^ (17) 

The size of the coefficient of e^ in equation (16) indicates clearly how rapidly 
probabilities diminish in the neighborhood of the most probable result. The 
approximation of equation (16) to equation (5) is shown on the graph. The 
dotted line is the graph of equation (16). 

We know from Bernoulli's theorem that if the dictionary actually consisted 
of X* words which we could characterize as known, and b — x* which would 
accordingly be unknown, as we take more and more samples from it the estimate 
formed from these samples must approach the value x*. We have, indeed, for 
the ratio of the a posteriori probability of a mixture containing x known words 
to the a posteriori probability of a mixture containing 61,000 known words 



_ _£x_ r x "["'[ 104000- a; "I ^-"^ 
"ireioooLeiOOoJ [ 43000 J ' 



where m known words have appeared in a total sample of k. If m/k — m equals 
61,000/104,000 the above ratio has its maximum when x = 61,000, in which 
case it is obviously unity. For any other value of x this ratio may be greater 
than unity for a given k, but must become and remain smaller than unity, as k 
increases indefinitely whatever the (finite) ratio of ttx to ttciooo- The truth of 
this statement is obvious from the form of the above equation. It thus appears 
that the a priori probability is vanishingly unimportant as the number of trials 
increases. If it were the case that 500 tests had been performed with the in- 
variable result of 117 known out of 200 chosen in each test we would have 
k = 100,000 and m = 58,500. It is then clear that the above ratio would be 
exceedingly small for any value of x other than 61,000 practically independently 
of the ratio iTx/Treiooo- 

It is, however, not strictly admissible to make such a calculation in our case. 
For one thing the result in the 500 tests differed, even though with surprisingly 
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small variability. Even more important than this, however, is the fact that we 
cannot in all strictness compare our problem with the analogous urn problem in 
case the test is used on more than one person, as it of course was. For the con- 
tent of the dictionary, from our point of view, actually changes with the observer, 
depending, as it does, upon how many words he knows. This is an added source 
of variability, over and beyond that which would occur due to the ordinary 
errors of sampling. The result of the above paragraph is still qualitatively 
applicable. 

Our final conclusion is, then, that the experimental evidence of the test com- 
pletely justifies us in abandoning the old result and accepting the new: and that, 
moreover, probabilities of mixtures in the immediate neighborhood of the most 
probable mixture themselves follow the normal Gaussian law as given by 
equation (16). 



A GRAPHICAL AID IN THE STUDY OF FUNCTIONS OF A COMPLEX 

VARIABLE. 

By NORMAN MILLER, Queen's University. 

The impossibility in three dimensions of representing graphically a function of 
a complex variable makes it necessary for the student to call on his imagination 
in other ways in order to realize the properties of these functions. Two methods 
are common in the geometrical theory of functions. One is to represent in two 
different planes or in two Riemann surfaces the variables z and w and to study the 
correspondence between the points of the two planes or surfaces, which is deter- 
mined by the relation w = /(s). The second method, which does much to illu- 
minate the subject for the beginner, is to represent in one plane both the independ- 
ent and dependent variables and to interpret the transformation kinematically as 
a flow of the points in the plane.^ 

A complete graph of the function w = /(z) or u-\- iv = f(x + iy) consists of 
a 2-dimensional manifold in space of four dimensions. Nevertheless the student, 
in his effort to visualize the function, thinks instinctively of a surface spread out 
over the plane of z. Such a surface is actually determined by taking for a third 
coordinate the absolute value of /(z). Calling the third coordinate f the equation 
of the surface is 



r= <{u{x,y)f+[o{x,y)]\ 

only the positive square root being taken. In this representation all points on a 
circle of center in the w-plane yield the same ordinate if. It is, in fact, by 
making no distinction among the points of such a circle that we are able to pass 
from a two-way spread in four dimensions to an actual surface in three dimensions. 
It is interesting to enquire what properties of the function /(z) are exhibited 

1 See in this connection an article by Cole, Annals of Mathematics, vol. 5, June, 1890. 



